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Abstract 

Many researchers have explored methods for hierarchical reinforce- 
ment learning (RL) with temporal abstractions, in which abstract 
actions are defined that can perform many primitive actions before 
terminating. However, little is known about learning with state ab- 
stractions, in which aspects of the state space are ignored. In previ- 
ous work, we developed the MAXQ method for hierarchical RL. In 
this paper, we define five conditions under which state abstraction 
can be combined with the MAXQ value function decomposition. 
We prove that the MAXQ-Q learning algorithm converges under 
these conditions and show experimentally that state abstraction is 
important for the successful application of MAXQ-Q learning. 
Category: Reinforcement Learning and Control 
Preference: Oral 



1 Introduction 

Most work on hierarchical reinforcement learning has focused on temporal abstrac- 
tion. For example, in the Options framework the programmer defines a set of 
macro actions ("options") and provides a policy for each. Learning algorithms (such 
as semi-Markov Q learning) can then treat these temporally abstract actions as if 
they were primitives and learn a policy for selecting among them. Closely related 
is the HAM framework, in which the programmer constructs a hierarchy of finite- 
state controllers || . Each controller can include non-deterministic states (where the 
programmer was not sure what action to perform). The HAMQ learning algorithm 
can then be applied to learn a policy for making choices in the non-deterministic 
states. In both of these approaches — and in other studies of hierarchical RL (e.g., 
H !t D) — each option or finite state controller must have access to the entire state 
space. The one exception to this — the Feudal-Q method of Dayan and Hinton 0- 
introduced state abstractions in an unsafe way such that the resulting learning 
problem was only partially observable. Hence, they could not provide any formal 
results for the convergence or performance of their method. 

Even a brief consideration of human- level intelligence shows that such methods can- 
not scale. When deciding how to walk from the bedroom to the kitchen, we do not 



need to think about the location of our car. Without state abstractions, any RL 
method that learns value functions must learn a separate value for each state of the 
world. Some argue that this can be solved by clever value function approximation 
methods — and there is some merit in this view. In this paper, however, we explore 
a different approach in which we identify aspects of the MDP that permit state ab- 
stractions to be safely incorporated in a hierarchical reinforcement learning method 
without introducing function approximations. This permits us to obtain the first 
proof of the convergence of hierarchical RL to an optimal policy in the presence of 
state abstraction. 

We introduce these state abstractions within the MAXQ framework [jjj, but the 
basic ideas are general. In our previous work with MAXQ, we briefly discussed state 
abstractions, and we employed them in our experiments. However, we could not 
prove that our algorithm (MAXQ-Q) converged with state abstractions, and we did 
not have a usable characterization of the situations in which state abstraction could 
be safely employed. This paper solves these problems and in addition compares the 
effectiveness of MAXQ-Q learning with and without state abstractions. The results 
show that state abstraction is very important, and in most cases essential, to the 
effective application of MAXQ-Q learning. 



2 The MAXQ Framework 

Let M be a Markov decision problem with states S, actions A, reward function 
R(s'\s,a) and probability transition function P(s'\s,a). Our results apply in both 
the finite-horizon undiscounted case and the infinite-horizon discounted case. Let 
{Mo, . . . , M n } be a set of subtasks of M, where each subtask Mj is defined by a 
termination predicate Tj and a set of actions A4 (which may be other subtasks or 
primitive actions from A) . The "goal" of subtask Mi is to move the environment into 
a state such that Ti is satisfied. (This can be refined using a local reward function 
to express preferences among the different states satisfying Tj J^], but we omit this 
refinement in this paper.) The subtasks of M must form a DAG with a single "root" 
node — no subtask may invoke itself directly or indirectly. A hierarchical policy is 
a set of policies tt — {tto, . . . , 7r„}, one for each subtask. A hierarchical policy 
is executed using standard procedure-call-and-return semantics, starting with the 
root task M$ and unfolding recursively until primitive actions are executed. When 
the policy for M, is invoked in state s, let P(s',N\s, i) be the probability that it 
terminates in state s' after executing N primitive actions. A hierarchical policy is 
recursively optimal if each policy 714 is optimal given the policies of its descendants 
in the DAG. 

Let V(i, s) be the value function for subtask i in state s (i.e., the value of following 
some policy starting in s until we reach a state s' satisfying Tj(s')). Similarly, let 
Q(i,s,j) be the Q value for subtask i of executing child action j in state s and 
then executing the current policy until termination. The MAXQ value function 
decomposition is based on the observation that each subtask Mj can be viewed as a 
Semi-Markov Decision problem in which the reward for performing action j in state 
s is equal to V(j, s), the value function for subtask j in state s. To see this, consider 
the sequence of rewards r t that will be received when we execute child action j and 
then continue with subsequent actions according to hierarchical policy n: 

Q(i, s, j) = E{r t + 77-4+1 + 7 2 r t+2 H \s t = s, n} 

The macro action j will execute for some number of steps N and then return. Hence, 



we can partition this sum into two terms: 



{N-l oo 
u=0 u=N 

The first term is the discounted sum of rewards until subtask j terminates — V(j, s). 
The second term is the cost of finishing subtask i after j is executed (discounted 
to the time when j is initiated). We call this second term the completion function, 
and denote it C(i, s,j). We can then write the Bellman equation as 

Q(i,s,j) = y2P(s',N\s,j)-[V(j,s) + 1 N maxQ(i,s',f)] 

s',N 

= V(j,s)+C(i,s,j) 

To terminate this recursion, define V(a, s) for a primitive action a to be the expected 
reward of performing action a in state s. 

The MAXQ-Q learning algorithm is a simple variation of Q learning in which at 
subtask Mj, state s, we choose a child action j and invoke its (current) policy. When 
it returns, we observe the resulting state s' and the number of elapsed time steps 
N and update C(i, s,j) according to 

C(i, s, j) := (1 - a t )C(i, s, j) + a t -j N [max V(a', s') + C(i, s', a')]. 




To prove convergence, we require that the exploration policy executed during learn- 
ing be an ordered GLIE policy. An ordered policy is a policy that breaks Q-value 
ties among actions by preferring the action that comes first in some fixed ordering. 
A GLIE policy [g is a policy that (a) executes each action infinitely often in every 
state that is visited infinitely often and (b) converges with probability 1 to a greedy 
policy. The ordering condition is required to ensure that the recursively optimal 
policy is unique. Without this condition, there are potentially many different re- 
cursively optimal policies with different values, depending on how ties are broken 
within subtasks, subsubtasks, and so on. 

Theorem 1 Let M — (S, A, P, R) be either an episodic MDP for which all de- 
terministic policies are proper or a discounted infinite horizon MDP with discount 
factor 7. Let H be a DAG defined over subtasks {Mq, . . . , M^}. Let a t (i) > be a 
sequence of constants for each subtask Mi such that 

T T 

lim > ott(i) — 00 and lim > ai(i) < 00 (1) 
t=l t=l 

Let Tr x (i,s) be an ordered GLIE policy at each subtask Mi and state s and assume 
that \Vt(i, s) and \Ct(i, s, a)\ are bounded for all t , i, s, anda. Then with probability 
1, algorithm MAXQ-Q converges to the unique recursively optimal policy for M 
consistent with H and n x . 

Proof: (sketch) The proof is based on Proposition 4.5 from Bertsekas and Tsit- 
siklis [0 and follows the standard stochastic approximation argument due to |ll| 
generalized to the case of non-stationary noise. There are two key points in the 
proof. Define P t (s',N\s,j) to be the probability transition function that describes 
the behavior of executing the current policy for subtask j at time t. By an inductive 
argument, we show that this probability transition function converges (w.p. 1) to 
the probability transition function of the recursively optimal policy for j. Second, 
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Figure 1: Left: The Taxi Domain (taxi at row 3 column 0). Right: Task Graph. 

we show how to convert the usual weighted max norm contraction for Q into a 
weighted max norm contraction for C . This is straightforward, and completes the 
proof. 

What is notable about MAXQ-Q is that it can learn the value functions of all 
subtasks simultaneously — it does not need to wait for the value function for subtask 
j to converge before beginning to learn the value function for its parent task i. This 
gives a completely online learning algorithm with wide applicability. 

3 Conditions for Safe State Abstraction 

To motivate state abstraction, consider the simple Taxi Task shown in Figure m. 
There are four special locations in this world, marked as R(ed), B(lue), G(reen), 
and Y(ellow). In each episode, the taxi starts in a randomly-chosen square. There 
is a passenger at one of the four locations (chosen randomly), and that passenger 
wishes to be transported to one of the four locations (also chosen randomly) . The 
taxi must go to the passenger's location (the "source"), pick up the passenger, go 
to the destination location (the "destination"), and put down the passenger there. 
The episode ends when the passenger is deposited at the destination location. 

There are six primitive actions in this domain: (a) four navigation actions that 
move the taxi one square North, South, East, or West, (b) a Pickup action, and (c) 
a Putdown action. Each action is deterministic. There is a reward of —1 for each 
action and an additional reward of +20 for successfully delivering the passenger. 
There is a reward of —10 if the taxi attempts to execute the Putdown or Pickup 
actions illegally. If a navigation action would cause the taxi to hit a wall, the action 
is a no-op, and there is only the usual reward of —1. 

This task has a hierarchical structure (see Fig. ^) in which there are two main 
sub-tasks: Get the passenger (Get) and Deliver the passenger (Put). Each of these 
subtasks in turn involves the subtask of navigating to one of the four locations 
(Navigate(i); where t is bound to the desired target location) and then performing 
a Pickup or Putdown action. This task illustrates the need to support both tem- 
poral abstraction and state abstraction. The temporal abstraction is obvious — for 
example, Get is a temporally extended action that can take different numbers of 
steps to complete depending on the distance to the target. The top level policy (get 
passenger; deliver passenger) can be expressed very simply with these abstractions. 

The need for state abstraction is perhaps less obvious. Consider the Get subtask. 
While this subtask is being solved, the destination of the passenger is completely 
irrelevant — it cannot affect any of the nagivation or pickup decisions. Perhaps more 
importantly, when navigating to a target location (either the source or destination 



location of the passenger) , only the taxi's location and identity of the target location 
are important. The fact that in some cases the taxi is carrying the passenger and 
in other cases it is not is irrelevant. 

We now introduce the five conditions for state abstraction. We will assume that the 
state s of the MDP is represented as a vector of state variables. A state abstraction 
can be defined for each combination of subtask Mi and child action j by identifying 
a subset X of the state variables that are relevant and defining the value function 
and the policy using only these relevant variables. Such value functions and policies 
are said to be abstract. 

The first two conditions involve eliminating irrelevant variables within a subtask of 
the MAXQ decomposition. 

Condition 1: Subtask Irrelevance. Let Mi be a subtask of MDP M. A set 

of state variables Y is irrelevant to subtask i if the state variables of M can be 
partitioned into two sets X and Y such that for any stationary abstract hierarchical 
policy it executed by the descendants of Mj, the following two properties hold: (a) 
the state transition probability distribution P 7T (s' , N\s,j) for each child action j of 
Mi can be factored into the product of two distributions: 

P*(x', v'i N\x, y,j) = P*(x', N\x,j) ■ P*(y'\y,j), (2) 

where x and x' give values for the variables in X, and y and y' give values for the 
variables in Y; and (b) for any pair of states s\ — (x,y\) and s 2 = { x ,y2) and any 
child action j, V^iJ, s\) = V n (j, s 2 ). 

In the Taxi problem, the source and destination of the passenger are irrelevant to 
the Navigate(f) subtask — only the target t and the current taxi position are relevant. 

Condition 2: Leaf Irrelevance. A set of state variables Y is irrelevant for a 
primitive action a if for any pair of states s\ and s 2 that differ only in their values 
for the variables in Y , 

]T o)iJ(«i|*i, o) = ]T P(4I*2, a)R(s' 2 \ s 2 , a). 

si 4 

This condition is satisfied by the primitive actions North, South, East, and West in 
the taxi task, where all state variables are irrelevant because R is constant. 

The next two conditions involve "funnel" actions — macro actions that move the 
environment from some large number of possible states to a small number of re- 
sulting states. The completion function of such subtasks can be represented using 
a number of values proportional to the number of resulting states. 

Condition 3: Result Distribution Irrelevance (Undiscounted case.) A set 

of state variables Yj is irrelevant for the result distribution of action j if, for all 
abstract policies tt executed by Mj and its descendants in the MAXQ hierarchy, the 
following holds: for all pairs of states s\ and s 2 that differ only in their values for 
the state variables in Yj, 

V*' P*(s'\ Sl ,j)=P*(s'\s2,j). 

Consider, for example, the Get subroutine under an optimal policy for the taxi 
task. Regardless of the taxi's position in state s, the taxi will be at the passenger's 
starting location when Get finishes executing (i.e., because the taxi will have just 
completed picking up the passenger). Hence, the taxi's initial position is irrelevant 
to its resulting position. (Note that this is only true in the undiscounted setting — 
with discounting, the result distributions are not the same because the number of 



steps iV required for Get to finish depends very much on the starting location of the 
taxi. Hence this form of state abstraction is rarely useful for cumulative discounted 
reward.) 

Condition 4: Termination. Let Mj be a child task of Mj with the property 
that whenever Mj terminates, it causes Mj to terminate too. Then the completion 
cost C(i, s,j) = and does not need to be represented. This is a particular kind of 
funnel action — it funnels all states into terminal states for Mj. 

For example, in the Taxi task, in all states where the taxi is holding the passenger, 
the Put subroutine will succeed and result in a terminal state for Root. This is 
because the termination predicate for Put (i.e., that the passenger is at his or her 
destination location) implies the termination condition for Root (which is the same) . 
This means that C(Root, s, Put) is uniformly zero, for all states s where Put is not 
terminated. 

Condition 5: Shielding. Consider subtask Mj and let s be a state such that 
for all paths from the root of the DAG down to Mj , there exists a subtask that is 
terminated. Then no C values need to be represented for subtask Mj in state s, 
because it can never be executed in s. 

In the Taxi task, a simple example of this arises in the Put task, which is terminated 
in all states where the passenger is not in the taxi. This means that we do not need 
to represent C(Root, s, Put) in these states. The result is that, when combined 
with the Termination condition above, we do not need to explicitly represent the 
completion function for Put at all! 

By applying these abstraction conditions to the Taxi task, the value function can 
be represented using 632 values, which is much less than the 3,000 values required 
by flat Q learning. Without state abstractions, MAXQ requires 14,000 values! 

Theorem 2 (Convergence with State Abstraction) Let H be a MAXQ task 
graph that incorporates the five kinds of state abstractions defined above. Let ir x be 
an ordered GLIE exploration policy that is abstract. Then under the same condi- 
tions as Theorem^, MAXQ-Q converges with probability 1 to the unique recursively 
optimal policy tt* defined by n x and H . 

Proof: (sketch) Consider a subtask Mj with relevant variables X and two ar- 
bitrary states (x, yx) and (a;, 2/2)- We first show that under the five abstraction 
conditions, the value function of ir* can be represented using C{i,x,j) (i.e., ignor- 
ing the y values). To learn the values of C(i,x,j) = J2 X , N P(x' , N\x, j)V(i, x'), a 
Q-learning algorithm needs samples of x' and N drawn according to P(x' , N\x,j). 
The second part of the proof involves showing that regardless of whether we execute 
j in state (x, y\) or in (x,y2), the resulting x' and N will have the same distribu- 
tion, and hence, give the correct expectations. Analogous arguments apply for leaf 
irrelevance and V(a,x). The termination and shielding cases are easy. 

4 Experimental Results 

We implemented MAXQ-Q for a noisy version of the Taxi domain and for Kael- 
bling's HDG navigation task || using Boltzmann exploration. Figure || shows the 
performance of flat Q and MAXQ-Q with and without state abstractions on these 
tasks. Learning rates and Boltzmann cooling rates were separately tuned to opti- 
mize the performance of each method. The results show that without state abstrac- 
tions, MAXQ-Q learning is slower to converge than flat Q learning, but that with 
state abstraction, it is much faster. 
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Figure 2: Comparison of MAXQ-Q with and without state abstraction to flat Q learning 
on a noisy taxi domain (left) and Kaelbling's HDG task (right). The horizontal axis gives 
the number of primitive actions executed by each method. The vertical axis plots the 
average of 100 separate runs. 



5 Conclusion 



This paper has shown that by understanding the reasons that state variables are 
irrelevant, we can obtain a simple proof of the convergence of MAXQ-Q learning 
under state abstraction. This is much more fruitful than previous efforts based 
only on weak notions of state aggregation |hJ , and it suggests that future research 
should focus on identifying other conditions that permit safe state abstraction. 
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